logo

Some contents and inspiration are from the Rmarkdown files provided in class by Javier Nogales.

In next cell some packages which are going to be used repeatedly are loaded.

library(tidyverse) 
library(lubridate) 
library(tsibble) 
library(tidymodels) 
library(modeltime) 
library(modeltime.ensemble) 
library(timetk) 
library(Quandl)
library(urca)
library(forecast)
library(bigtime)
library(modeltime.ensemble) 
library(timetk) 

1. Data preprocessing and visualization tools

1.1 Data Acquisition

The data set is obtained from Blockchain, which is a website that publishes data related to Bitcoin, updated daily. It is published in Quandl, so we can access to it by means of an API.

The goal of the project is predicting the bitcoin market price, employing the past values of the own time series as well as other variables. These features are selected previously, thinking that they could ease the prediction of future values. The selected predictors are:

The target is:

Once we have identified the target and the variables we want to use to make the predictions, we need to look for the corresponding tokens in each time series. They will be renamed after according to the variables names:

  • price: Bitcoin Market Price USD.
  • volume: Bitcoin USD Exchange Trade Volume.
  • trans: Bitcoin Number of Transactions.
  • trans_cost: Bitcoin Cost Per Transaction.
tokens = c("BCHAIN/MKPRU",    # Bitcoin Market Price USD
           "BCHAIN/TRVOU",    # Bitcoin USD Exchange Trade Volume
           "BCHAIN/NTRAN",    # Bitcoin Number of Transactions
           "BCHAIN/CPTRA"     # Bitcoin Cost Per Transaction
)

names = c("date", "price", "volume", "trans", "trans_cost")

With the corresponding tokens, we are able to call the API and collect all the data. In this case it is employed data from 2021-01-01 to 2022-03-18. The frequency is daily, so we have enough samples with this interval.

Quandl.api_key("7ee5wB3N-fu3EqTcth8L") # in case you exceed the number of API requests
data = Quandl(tokens, type="raw",
                 start_date = "2021-01-01", 
                 end_date = "2022-03-18")

bitcoin = as_tibble(data)

A loop is used to iterate through the columns and rename them, in order to be more understandable.

for (i in 1:length(names)){
  colnames(bitcoin)[i] = names[i]
}

For some function and visualization tools is required having all the columns as rows an identified with the corresponding name.

bitcoin_by_var = pivot_longer(bitcoin, c(2:length(names)),
                               names_to = "var",
                               values_to = "value")

Finally, we are able to visualize the multiple time series.

bitcoin_by_var %>% 
  ggplot(aes(x = date, y = value)) +
  geom_line() +
  labs(title = "Daily Values",
       x = "", y="variables") +
  facet_grid(vars(var), scales = "free_y") +
  theme_bw() 

1.2 Pre-processing

Before applying any statistical or machine learning model, it is advisable to apply some preprocessing steps or check if any data transformation is required.

1.2.1 Missing values

First, we have to check if there is any missing value.

bitcoin_by_var %>% 
  summarise(na = sum(is.na(value)))
## # A tibble: 1 × 1
##      na
##   <int>
## 1     0

In this case, there are non missing values.

1.2.2 ACF and PACF

With the ACF and PACF plot of the variables, we are able to detect if data is stationary or any seasonal pattern.

bitcoin_by_var %>%
    group_by(var) %>%
    plot_acf_diagnostics(date, value, .lags = 14)

As it can be seen, the target price does not present any seasonal pattern. However, the ACF plot reveals a high correlation between lags, although only the first and second spikes are relevant in the PACF. For this reason, a first difference in this variables could be worthy.

In the variables, we can appreciate a seasonal pattern each 7 days, weekly seasonality. This may lead to applying a seasonal difference.

1.2.3 STL decomposition

If we look into the STL decomposition of the target price, we can see how most of the variance cannot be explained by the season and trend components. All the fluctuations are presented in the remainder.

plot_stl_diagnostics(bitcoin, .date_var = date, .value = price)

In spite of that, the season component each week is clear, so it is a point to consider in the models.

If the decomposition is compared with another variable, for example with volume, we can clearly see the differences.

plot_stl_diagnostics(bitcoin, .date_var = date, .value = volume)

In this case, the remainder component does not model the variability as much as with price. As the seasonal pattern is also clear here, differentiation should be applied.

1.2.4 Differentiation

After the correlograms and the STL decomposition, differentiating some variables could be worthy for the future models, at least, having this modifications as new variables.

In next cell we are going to apply the KPSS test in order to see if differentiation is required in the target.

ts(bitcoin[,"price"]) %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.437 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

In spite of overcoming the test respect some significance levels, differentiation should be consider in the target.

diff(ts(bitcoin[,"price"])) %>% ur.kpss() %>% summary()
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.1545 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

After applying it, the margin is larger. If we compare the previous ACF and PACF plots of price.

bitcoin[,"price"] %>%
    plot_acf_diagnostics(date, price, .lags = 12)

And the new ones.

as_tibble(diff(ts(bitcoin[,"price"]))) %>%
    plot_acf_diagnostics(date, price, .lags = 12)

The same procedure can be applied with the variables. In this case if we compare the correlogram of volume.

bitcoin[,"volume"] %>%
    plot_acf_diagnostics(date, volume, .lags = 12)

With the one after applying a weekly seasonal difference.

as_tibble(diff(ts(bitcoin[,"volume"])), lag = 7) %>%
    plot_acf_diagnostics(date, volume, .lags = 12)

We can see how the weekly dependence is smoothed.

Finally, these differences are applied and stored as new variables in a new tibble table:

  • First difference to the target price.
  • Weekly seasonal difference to the variables: volume, trans and trans_cost.
bitcoin_processed = bitcoin %>% 
  mutate(
    price_diff = difference(price),
    trans_diff = difference(trans, 7), 
    trans_cost_diff = difference(trans_cost, 7),
    volume_diff = difference(volume, 7))

bitcoin_processed
## # A tibble: 442 × 9
##    date        price      volume  trans trans_cost price_diff trans_diff
##    <date>      <dbl>       <dbl>  <dbl>      <dbl>      <dbl>      <dbl>
##  1 2021-01-01 28983.  606524119. 258080      112.         NA          NA
##  2 2021-01-02 29394.  491258038. 297111      107.        411.         NA
##  3 2021-01-03 32195. 1393906626. 359116      100.       2802.         NA
##  4 2021-01-04 33001.  975831415. 373734      104.        805.         NA
##  5 2021-01-05 32035. 1432013253. 354091       99.7      -966.         NA
##  6 2021-01-06 34047. 1026923832. 397384      108.       2012.         NA
##  7 2021-01-07 36860. 1414154053. 401744      112.       2814.         NA
##  8 2021-01-08 39486. 1836614179. 358526      118.       2626.     100446
##  9 2021-01-09 40670. 1794260807. 321389      123.       1184.      24278
## 10 2021-01-10 40241.  711474843. 331865      130.       -430.     -27251
## # … with 432 more rows, and 2 more variables: trans_cost_diff <dbl>,
## #   volume_diff <dbl>

1.2.5 Lagged values

Another important point is if the target can be explained using lagged values of variables or from the own target. With the next function, a lagged column can be added to a given data.

lag_transformer_grouped <- function(data, var, lag){
    data %>%
    tk_augment_lags(var, .lags = lag)
}

In addition, it is important to remark that if we want to predict which is going to be the value of the bitcoin tomorrow, we do not have the future values of other indicators, in our case, from the used variables. For this reason, a lag of 1 day is applied in each variable, in order to be more realistic with the prediction.

The new created variables are stored as before in bitcoin_preprocessed.

bitcoin_processed = bitcoin_processed %>%
  lag_transformer_grouped("trans", 1) %>% 
  lag_transformer_grouped("trans_cost", 1) %>% 
  lag_transformer_grouped("volume", 1)

bitcoin_processed
## # A tibble: 442 × 12
##    date        price      volume  trans trans_cost price_diff trans_diff
##    <date>      <dbl>       <dbl>  <dbl>      <dbl>      <dbl>      <dbl>
##  1 2021-01-01 28983.  606524119. 258080      112.         NA          NA
##  2 2021-01-02 29394.  491258038. 297111      107.        411.         NA
##  3 2021-01-03 32195. 1393906626. 359116      100.       2802.         NA
##  4 2021-01-04 33001.  975831415. 373734      104.        805.         NA
##  5 2021-01-05 32035. 1432013253. 354091       99.7      -966.         NA
##  6 2021-01-06 34047. 1026923832. 397384      108.       2012.         NA
##  7 2021-01-07 36860. 1414154053. 401744      112.       2814.         NA
##  8 2021-01-08 39486. 1836614179. 358526      118.       2626.     100446
##  9 2021-01-09 40670. 1794260807. 321389      123.       1184.      24278
## 10 2021-01-10 40241.  711474843. 331865      130.       -430.     -27251
## # … with 432 more rows, and 5 more variables: trans_cost_diff <dbl>,
## #   volume_diff <dbl>, trans_lag1 <dbl>, trans_cost_lag1 <dbl>,
## #   volume_lag1 <dbl>

1.2.5.1 Multivariate Time-Series Model - VAR

Until the moment we have not seen how the variables are related themselves. For this reason, a VAR model is going to be useful to indicate us how the variables are affected between them.

plot_lag_matrix = function(data){
  VAR.L1 <- sparseVAR(Y=scale(as.matrix(data[,-1])),
                    h = 1, 
                    selection = "cv",
                    VARpen = "L1") 
  LhatL1 = lagmatrix(fit=VAR.L1, returnplot=TRUE)
}

plot_lag_matrix(bitcoin)

Given the matrix, we are only interested in the first row, since we want to predict price.

  • There is a 3-daily lagged effect of trans on price.
  • There is a 2-daily lagged effect of trans_cost on price.
  • Finally, the own 7-daily lagged effect of price.

These lagged values are stored as new columns in the processed tibble data.

bitcoin_processed = bitcoin_processed %>%
  lag_transformer_grouped("price", 7) %>% 
  lag_transformer_grouped("trans", 3) %>% 
  lag_transformer_grouped("trans_cost", 2)

bitcoin_processed
## # A tibble: 442 × 15
##    date        price      volume  trans trans_cost price_diff trans_diff
##    <date>      <dbl>       <dbl>  <dbl>      <dbl>      <dbl>      <dbl>
##  1 2021-01-01 28983.  606524119. 258080      112.         NA          NA
##  2 2021-01-02 29394.  491258038. 297111      107.        411.         NA
##  3 2021-01-03 32195. 1393906626. 359116      100.       2802.         NA
##  4 2021-01-04 33001.  975831415. 373734      104.        805.         NA
##  5 2021-01-05 32035. 1432013253. 354091       99.7      -966.         NA
##  6 2021-01-06 34047. 1026923832. 397384      108.       2012.         NA
##  7 2021-01-07 36860. 1414154053. 401744      112.       2814.         NA
##  8 2021-01-08 39486. 1836614179. 358526      118.       2626.     100446
##  9 2021-01-09 40670. 1794260807. 321389      123.       1184.      24278
## 10 2021-01-10 40241.  711474843. 331865      130.       -430.     -27251
## # … with 432 more rows, and 8 more variables: trans_cost_diff <dbl>,
## #   volume_diff <dbl>, trans_lag1 <dbl>, trans_cost_lag1 <dbl>,
## #   volume_lag1 <dbl>, price_lag7 <dbl>, trans_lag3 <dbl>,
## #   trans_cost_lag2 <dbl>

After differentiating and creating lagged columns, the new variables are going to have missing values. As there are enough data, these rows are removed, in order not to have problems with the models.

bitcoin_processed = bitcoin_processed %>%
  drop_na()

bitcoin_processed 
## # A tibble: 435 × 15
##    date        price      volume  trans trans_cost price_diff trans_diff
##    <date>      <dbl>       <dbl>  <dbl>      <dbl>      <dbl>      <dbl>
##  1 2021-01-08 39486. 1836614179. 358526       118.      2626.     100446
##  2 2021-01-09 40670. 1794260807. 321389       123.      1184.      24278
##  3 2021-01-10 40241.  711474843. 331865       130.      -430.     -27251
##  4 2021-01-11 38240. 1415450813. 333958       113.     -2001.     -39776
##  5 2021-01-12 35545. 3039065356. 336627       113.     -2695.     -17464
##  6 2021-01-13 34012. 1344114135. 319167       117.     -1533.     -78217
##  7 2021-01-14 37393. 1209077692. 338809       121.      3381.     -62935
##  8 2021-01-15 39158.  988940894. 344002       121.      1765.     -14524
##  9 2021-01-16 36829. 1194986405. 308461       122.     -2330.     -12928
## 10 2021-01-17 36065.  685084802. 271874       117.      -763.     -59991
## # … with 425 more rows, and 8 more variables: trans_cost_diff <dbl>,
## #   volume_diff <dbl>, trans_lag1 <dbl>, trans_cost_lag1 <dbl>,
## #   volume_lag1 <dbl>, price_lag7 <dbl>, trans_lag3 <dbl>,
## #   trans_cost_lag2 <dbl>

Only 7 rows were removed.

1.3 Forecasting functions

Next function is employed to get the forecast and prediction intervals in one split. The main arguments are:

  • sp: integer to select the split.
  • conf: the confidence level. It is used 95% as default.

Mention the argument back only is used in the visualization of the actual data. It indicates how many samples are plotted from the training set (actual data).

forecast_results = function(model_table, sp=1, conf=0.95, back=30){
  calibration_table = model_table %>%
    modeltime_calibrate(testing(splits$splits[[sp]]))
  train_idx = splits$splits[[sp]]$in_id
  beg = train_idx[length(train_idx)] - back
  test_idx = splits$splits[[sp]]$out_id
  n_col = ncol(bitcoin_processed)
  fc = calibration_table %>%
    modeltime_forecast(actual_data = bitcoin_processed[beg:test_idx[length(test_idx)], 
                                                       1:n_col],
                       new_data = bitcoin_processed[test_idx, 1:n_col],
                       conf_interval = conf)
  out_list = list("fc"=fc, "calibration"=calibration_table)
  return(out_list)
}

Next function are used to show the forecast accuracy and plot the predicted values.

plot_forecast_table = function(calibration_table){
  calibration_table %>%
    modeltime_accuracy() %>%
    table_modeltime_accuracy(.interactive = FALSE)
}

plot_forecast = function(fc_results){
  fc_results %>%
    plot_modeltime_forecast(.interactive = FALSE) + 
    labs(title = "Forecasts", y = "")
}

1.4 Cross validation - Back testing

The evaluation of the models is going to be carried out by cross validation, using sliding windows. Next functions are used to get the corresponding splits and plot them.

cv_split = function(data, 
                    initial="5 months", 
                    assess="1 months", 
                    skip="2 months", 
                    cumulative=FALSE){
  splits = data %>% 
    time_series_cv(date_var = date, 
                   initial = initial,
                   assess = assess,
                   skip = skip,
                   cumulative = cumulative)
  return(splits)
}

plot_cv_splits = function(splits){
  splits %>%
    tk_time_series_cv_plan() %>%
    plot_time_series_cv_plan(date, price, .interactive = FALSE)
}

splits = cv_split(bitcoin_processed)
plot_cv_splits(splits)

There are 5 splits. As models need to be first trained in one split, the first one is selected.

sp = 1

1.4.1 Re-training functions

These functions are employed to train models by cross validation, as well as plotting the process and the accuracy table. They are used in next sections.

training_results = function(model_table, splits){
  resample_results = model_table %>%
    modeltime_fit_resamples(resamples = splits,
                            control = control_resamples(verbose = TRUE)
                            )
  return(resample_results)
}

plot_training_cv = function(resample_results){
  resample_results %>%
    plot_modeltime_resamples(.summary_fn = mean, 
                             .point_size  = 3,
                             .interactive = FALSE
  )
}

plot_training_table = function(resample_results){
  resample_results %>%
    modeltime_resample_accuracy(summary_fns = list(mean = mean)) %>%
    table_modeltime_accuracy(.interactive = FALSE)
}

2. Statistical tools

This section is composed by simple models where only the target is used and dynamic regression, in which the rest of the variables are employed.

2.1 Automatic forecasting using modeltime

Simple models using modeltime are used to start. Only the target price is employed and can be useful as benchmarks. The ARIMA and Prophet are selected.

2.1.1 Models definition

2.1.1.1 ARIMA

First implementation corresponds with an auto arima.

mod_arima = 
  arima_reg() %>%
  set_engine("auto_arima") %>%
  fit(price ~ date, 
      training(splits$splits[[sp]]))

As it has been explained in the preprocessing steps, first and seasonal difference of price may be worthy. In addition, the significant first spike in the PACF is translated as \(p=1\) in the arima.

Mention the seasonal MA term is included because it improves the performance taking into account the BIC.

mod_arima_custom = 
  arima_reg(
    seasonal_period          = "auto",
    non_seasonal_ar          = 1,
    non_seasonal_differences = 1,
    non_seasonal_ma          = 0,
    seasonal_ar              = 0,
    seasonal_differences     = 1,
    seasonal_ma              = 1
    ) %>%
  set_engine("arima") %>%
  fit(price ~ date, 
      training(splits$splits[[sp]])) 

2.1.1.2 Prophet

First the default prophet model is employed.

mod_prophet = 
  prophet_reg() %>%
  set_engine("prophet") %>%
  fit(price ~ date, 
      training(splits$splits[[sp]]))

Second, some parameters are selected in order to see if the performance can improve.

mod_prophet_custom = 
  prophet_reg(
    growth               = "linear",
    season               = "additive",
    seasonality_daily    = "auto"
  ) %>%
  set_engine("prophet") %>%
  fit(price ~ date, 
      training(splits$splits[[sp]]))

Once the models are defined, they are stored in a modeltime_table.

autoMod_table = modeltime_table(
  mod_arima, 
  mod_arima_custom,
  mod_prophet, 
  mod_prophet_custom
) 

autoMod_table
## # Modeltime Table
## # A tibble: 4 × 3
##   .model_id .model   .model_desc           
##       <int> <list>   <chr>                 
## 1         1 <fit[+]> ARIMA(0,1,0)(2,0,0)[7]
## 2         2 <fit[+]> ARIMA(1,1,0)(0,1,1)[7]
## 3         3 <fit[+]> PROPHET               
## 4         4 <fit[+]> PROPHET

2.1.2 Forecasting with the automatic models

First step is obtaining the forecast data and the calibration table with all the models.

autoForecast = forecast_results(autoMod_table)

The calibration table in which all models are stored can be accessed as:

autoForecast$calibration
## # Modeltime Table
## # A tibble: 4 × 5
##   .model_id .model   .model_desc            .type .calibration_data
##       <int> <list>   <chr>                  <chr> <list>           
## 1         1 <fit[+]> ARIMA(0,1,0)(2,0,0)[7] Test  <tibble [31 × 4]>
## 2         2 <fit[+]> ARIMA(1,1,0)(0,1,1)[7] Test  <tibble [31 × 4]>
## 3         3 <fit[+]> PROPHET                Test  <tibble [31 × 4]>
## 4         4 <fit[+]> PROPHET                Test  <tibble [31 × 4]>

The forecast values can be visualized as:

idx = 100:110
autoForecast$fc[idx,]
## # A tibble: 11 × 7
##    .model_id .model_desc            .key     .index     .value .conf_lo .conf_hi
##        <int> <chr>                  <fct>    <date>      <dbl>    <dbl>    <dbl>
##  1         2 ARIMA(1,1,0)(0,1,1)[7] predict… 2022-02-22 42372.   36803.   47940.
##  2         2 ARIMA(1,1,0)(0,1,1)[7] predict… 2022-02-23 42266.   36698.   47835.
##  3         2 ARIMA(1,1,0)(0,1,1)[7] predict… 2022-02-24 42338.   36770.   47906.
##  4         2 ARIMA(1,1,0)(0,1,1)[7] predict… 2022-02-25 41955.   36386.   47523.
##  5         2 ARIMA(1,1,0)(0,1,1)[7] predict… 2022-02-26 41701.   36132.   47269.
##  6         2 ARIMA(1,1,0)(0,1,1)[7] predict… 2022-02-27 41844.   36276.   47413.
##  7         2 ARIMA(1,1,0)(0,1,1)[7] predict… 2022-02-28 42121.   36552.   47689.
##  8         2 ARIMA(1,1,0)(0,1,1)[7] predict… 2022-03-01 42132.   36564.   47701.
##  9         2 ARIMA(1,1,0)(0,1,1)[7] predict… 2022-03-02 42027.   36459.   47596.
## 10         2 ARIMA(1,1,0)(0,1,1)[7] predict… 2022-03-03 42099.   36530.   47667.
## 11         2 ARIMA(1,1,0)(0,1,1)[7] predict… 2022-03-04 41715.   36147.   47284.

We are interested in:

  • .value: the value being forecasted.
  • .conf_lo: the lower limit of the prediction interval.
  • .conf_hi: the upper limit of the prediction interval.

The test accuracy is shown with next function.

plot_forecast_table(autoForecast$calibration)
Accuracy Table
.model_id .model_desc .type mae mape mase smape rmse rsq
1 ARIMA(0,1,0)(2,0,0)[7] Test 3348.60 8.58 2.55 8.16 3700.40 0.36
2 ARIMA(1,1,0)(0,1,1)[7] Test 2499.28 6.39 1.90 6.16 2818.08 0.05
3 PROPHET Test 2995.19 7.27 2.28 7.66 3632.76 0.02
4 PROPHET Test 2995.19 7.27 2.28 7.66 3632.76 0.02

The arima model in which the values are introduced manually obtains the best score. Finally, we can plot the forecast values for the selected split (sp=1) as well as the prediction intervals.

plot_forecast(autoForecast$fc)

2.1.3 Re-train automatic models

Given the modeltime_table, the training function can be called. First, the models are trained and evaluated using all the splits.

autoMod_resample_results = training_results(autoMod_table, splits)
## ── Fitting Resamples ────────────────────────────────────────────
## 
## 13.933 sec elapsed

We can easily plot the accuracy results in every split.

plot_training_cv(autoMod_resample_results)

As it is shown, in general terms the arima models perform better in every split than prophet. The mean accuracy is plotted with next function.

plot_training_table(autoMod_resample_results)
Accuracy Table
.model_id .model_desc .type n mae_mean mape_mean mase_mean smape_mean rmse_mean rsq_mean
1 ARIMA(0,1,0)(2,0,0)[7] Resamples 5 3550.09 8.06 3.16 8.03 4040.96 NA
2 ARIMA(1,1,0)(0,1,1)[7] Resamples 5 3553.95 7.16 2.86 6.81 4145.97 0.21
3 PROPHET Resamples 5 5129.00 12.17 4.92 13.04 5886.67 0.25
4 PROPHET Resamples 5 5129.00 12.17 4.92 13.04 5886.67 0.25

As expected, the arima models overcome prophet. However, to remark that the automatic arima and the manual one perform similar, when in general the manual way provides better results and in the forecasting there was a significant difference.

2.2 Automatic dynamic regression

Now the variables are considered in a dynamic regression with ARIMA.

2.2.1 Models definition

First model employs only the original variables. Mention it is not a very realistic scenario, since as mentioned before, the future values of these predictors are not known when we want to predict the future target.

dynamic_arima1 = 
  arima_reg(
    seasonal_period          = "auto",
    non_seasonal_ar          = 1,
    non_seasonal_differences = 1,
    non_seasonal_ma          = 0,
    seasonal_ar              = 0,
    seasonal_differences     = 1,
    seasonal_ma              = 1
    ) %>%
  set_engine("arima") %>%
  fit(price ~ date + trans + trans_cost + volume, 
      training(splits$splits[[sp]]))

In the second model the same idea is applied by employing the lagged values in the variables.

dynamic_arima2 = 
  arima_reg(
    seasonal_period          = "auto",
    non_seasonal_ar          = 1,
    non_seasonal_differences = 1,
    non_seasonal_ma          = 0,
    seasonal_ar              = 0,
    seasonal_differences     = 1,
    seasonal_ma              = 1
    ) %>%
  set_engine("arima") %>%
  fit(price ~ date + trans_lag1 + trans_cost_lag1 + volume_lag1, 
      training(splits$splits[[sp]]))

In this case we use the lagged values given by the VAR model. Mention the one corresponding to price is not used since this is modeled by arima.

dynamic_arima3 = 
  arima_reg(
    seasonal_period          = "auto",
    non_seasonal_ar          = 1,
    non_seasonal_differences = 1,
    non_seasonal_ma          = 0,
    seasonal_ar              = 0,
    seasonal_differences     = 1,
    seasonal_ma              = 1
    ) %>%
  set_engine("arima") %>%
  fit(price ~ date + trans_lag3 + trans_cost_lag2, 
      training(splits$splits[[sp]]))

This model employs the differentiated values. As before, the one corresponding to price is not used, since this is the role of arima.

dynamic_arima4 = 
  arima_reg(
    seasonal_period          = "auto",
    non_seasonal_ar          = 1,
    non_seasonal_differences = 1,
    non_seasonal_ma          = 0,
    seasonal_ar              = 0,
    seasonal_differences     = 1,
    seasonal_ma              = 1
    ) %>%
  set_engine("arima") %>%
  fit(price ~ date + trans_diff + trans_cost_diff + volume_diff, 
      training(splits$splits[[sp]]))

This models employs a mixed combination of variables.

dynamic_arima5 = 
  arima_reg(
    seasonal_period          = "auto",
    non_seasonal_ar          = 1,
    non_seasonal_differences = 1,
    non_seasonal_ma          = 0,
    seasonal_ar              = 0,
    seasonal_differences     = 1,
    seasonal_ma              = 1
    ) %>%
  set_engine("arima") %>%
  fit(price ~ date  + volume_lag1  + trans_cost_lag2 + trans_diff,
      training(splits$splits[[sp]]))

The last one considers all the available predictors.

dynamic_arima6 = 
  arima_reg(
    seasonal_period          = "auto",
    non_seasonal_ar          = 1,
    non_seasonal_differences = 1,
    non_seasonal_ma          = 0,
    seasonal_ar              = 0,
    seasonal_differences     = 1,
    seasonal_ma              = 1
    ) %>%
  set_engine("arima") %>%
  fit(price ~ date + trans_lag1 + trans_cost_lag1 + volume_lag1 +
        trans_lag3 + trans_cost_lag2 +
        trans_diff + trans_cost_diff + volume_diff,
      training(splits$splits[[sp]]))

The evaluation process is the same as before. First step is saving all the models in a modeltime_table.

dynMod_table = modeltime_table(
  dynamic_arima1,
  dynamic_arima2,
  dynamic_arima3,
  dynamic_arima4,
  dynamic_arima5,
  dynamic_arima6
) 

dynMod_table
## # Modeltime Table
## # A tibble: 6 × 3
##   .model_id .model   .model_desc                                  
##       <int> <list>   <chr>                                        
## 1         1 <fit[+]> REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS
## 2         2 <fit[+]> REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS
## 3         3 <fit[+]> REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS
## 4         4 <fit[+]> REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS
## 5         5 <fit[+]> REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS
## 6         6 <fit[+]> REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS

2.2.2 Forecasting with the dynamic regression models

As before, we can get the accuracy table with the defined functions.

dynForecast = forecast_results(dynMod_table)
plot_forecast_table(dynForecast$calibration)
Accuracy Table
.model_id .model_desc .type mae mape mase smape rmse rsq
1 REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS Test 2095.07 5.30 1.60 5.19 2307.57 0.17
2 REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS Test 2172.84 5.54 1.65 5.38 2448.70 0.24
3 REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS Test 2526.46 6.46 1.92 6.23 2849.38 0.03
4 REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS Test 2293.36 5.85 1.75 5.67 2576.88 0.18
5 REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS Test 2482.06 6.35 1.89 6.12 2800.30 0.14
6 REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS Test 1544.31 3.88 1.18 3.84 1796.65 0.43

The results are very similar, excluding the last models which overcomes significantly the rest. However this one considers all the variables and may overfit.

plot_forecast(dynForecast$fc)

Looking into the forecasting series, there are not significant differences between the models.

2.2.3 Re-train dynamic regression models

Using all the splits.

dynMod_resample_results = training_results(dynMod_table, splits)
## ── Fitting Resamples ────────────────────────────────────────────
## 
## 27.733 sec elapsed
plot_training_cv(dynMod_resample_results)

All models perform similar along all the splits. If we look into the average accuracy results.

plot_training_table(dynMod_resample_results)
Accuracy Table
.model_id .model_desc .type n mae_mean mape_mean mase_mean smape_mean rmse_mean rsq_mean
1 REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS Resamples 5 2659.22 5.40 2.20 5.26 3135.23 0.19
2 REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS Resamples 5 3225.13 6.62 2.64 6.34 3825.89 0.22
3 REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS Resamples 5 3434.56 6.92 2.77 6.58 4021.33 0.19
4 REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS Resamples 5 3836.31 7.73 3.10 7.34 4482.92 0.18
5 REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS Resamples 5 3182.75 6.41 2.60 6.16 3760.46 0.15
6 REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS Resamples 5 3376.27 7.06 2.84 6.74 3970.77 0.19

As expected, the best model is the one in which we provide the current values of the variables. Nevertheless, as mentioned before, that is not a realistic scenario.

However, the performance obtained by the arima with the one day lagged values as variables obtains a good performance. Also mention the one which uses a mixture of lagged and differentiated variables.

2.3 Comparison of the best statistical models

In this part we are going to consider the statistical models which provide the best results in the forecasting, since at the end it is the most recent data and in which we are more interested. The selected models are:

  • The arima in which the values has been introduced manually.
  • The dynamic regression of arima and all the variables.
staMod_table = modeltime_table(
  mod_arima_custom,
  dynamic_arima6
) 

staMod_table
## # Modeltime Table
## # A tibble: 2 × 3
##   .model_id .model   .model_desc                                  
##       <int> <list>   <chr>                                        
## 1         1 <fit[+]> ARIMA(1,1,0)(0,1,1)[7]                       
## 2         2 <fit[+]> REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS

2.3.1 Forecasting with the best statistical models

staForecast = forecast_results(staMod_table)
plot_forecast_table(staForecast$calibration)
Accuracy Table
.model_id .model_desc .type mae mape mase smape rmse rsq
1 ARIMA(1,1,0)(0,1,1)[7] Test 2499.28 6.39 1.90 6.16 2818.08 0.05
2 REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS Test 1544.31 3.88 1.18 3.84 1796.65 0.43

As we have seen, the dynamic regression overcomes the simple arima.

plot_forecast(staForecast$fc)

However, the forecast and prediction intervals are similar.

2.3.2 Cross validation of the best statistical models

staMod_resample_results = training_results(staMod_table, splits)
## ── Fitting Resamples ────────────────────────────────────────────
## 
## 7.574 sec elapsed
plot_training_cv(staMod_resample_results)

The performance of the model with regressors is better than the simple arima.

plot_training_table(staMod_resample_results)
Accuracy Table
.model_id .model_desc .type n mae_mean mape_mean mase_mean smape_mean rmse_mean rsq_mean
1 ARIMA(1,1,0)(0,1,1)[7] Resamples 5 3553.95 7.16 2.86 6.81 4145.97 0.21
2 REGRESSION WITH ARIMA(1,1,0)(0,1,1)[7] ERRORS Resamples 5 3376.27 7.06 2.84 6.74 3970.77 0.19

As we can see, the aggregation of variables in a dynamic regression improves the result obtained by using a simple arima model.

3. Machine learning tools

The goal is predicting the target using all the available features. With machine learning, all the time series are used as predictors, the target and the variables, to predict the future value of the response.

3.1 Pre-processing

Before applying machine learning models, some steps have to be carried out. In this case a recipe is creating, in which:

  • Some variables created by default are deleted.
  • The current values of volume, trans and trans_cost are removed, in order to be more realistic respect the prediction.
  • The rest of variables are normalized.
recipe_spec = recipe(price ~ ., 
                     training(splits$splits[[sp]])) %>%
  step_timeseries_signature(date) %>%
  step_rm(contains("date_")) %>%
  step_rm("volume", "trans", "trans_cost") %>%
  step_normalize(price_diff, trans_diff, trans_cost_diff, volume_diff, 
                 trans_lag1, trans_cost_lag1, volume_lag1, 
                 price_lag7, trans_lag3, trans_cost_lag2)

recipe_spec %>% prep() %>% juice() %>% head()
## # A tibble: 6 × 12
##   date       price_diff trans_diff trans_cost_diff volume_diff trans_lag1
##   <date>          <dbl>      <dbl>           <dbl>       <dbl>      <dbl>
## 1 2021-09-18     -0.284      0.504           0.540    -0.463        0.270
## 2 2021-09-19      0.601      0.428           0.825     0.154       -1.13 
## 3 2021-09-20     -0.569      0.780          -0.170    -0.00766     -2.01 
## 4 2021-09-21     -2.51      -1.29           -1.62      0.828        0.244
## 5 2021-09-22     -1.31       0.344          -0.984     1.87        -0.585
## 6 2021-09-23      1.76      -0.137           0.638     0.457        0.406
## # … with 6 more variables: trans_cost_lag1 <dbl>, volume_lag1 <dbl>,
## #   price_lag7 <dbl>, trans_lag3 <dbl>, trans_cost_lag2 <dbl>, price <dbl>

3.2 Models definition

Different models are going to be fitted and evaluated to predict the bitcoin value. The options are elastic net, random forest, XGBoost and the neural network NNETAR.

In order to find the best hyper-parameters, different options are tried. Mention an automatic hyper-parameter search using cross validation, which is the optimal approach, was not possible to implement due to several errors during the process. One cause could be using several variables rather than using only the target.

3.2.1 Elastic net

The considered hyper-parameters are:

  • penalty: a non-negative number representing the total amount of regularization.
  • mixture: a number between zero and one that is the proportion of L1 regularization (lasso) in the model.

The idea is to penalize models which uses a lot of variables. The first one is a pure mixture between lasso and ridge regression.

mod_glmnet1 = 
  linear_reg(
    penalty = 0.1, 
    mixture = 0.5
    ) %>%
  set_engine("glmnet")

wflw_glmnet1 = workflow() %>%
  add_model(mod_glmnet1) %>%
  add_recipe(recipe_spec %>% step_rm(date)) %>%
  fit(training(splits$splits[[sp]]))

Second one has a larger value of mixture, so the model will be similar to a lasso regression.

mod_glmnet2 = 
  linear_reg(
    penalty = 0.1, 
    mixture = 0.9
    ) %>%
  set_engine("glmnet")

wflw_glmnet2 = workflow() %>%
  add_model(mod_glmnet2) %>%
  add_recipe(recipe_spec %>% step_rm(date)) %>%
  fit(training(splits$splits[[sp]]))

In contrast, the third one is close to a ridge regression.

mod_glmnet3 = 
  linear_reg(
    penalty = 0.1, 
    mixture = 0.1
    ) %>%
  set_engine("glmnet")

wflw_glmnet3 = workflow() %>%
  add_model(mod_glmnet3) %>%
  add_recipe(recipe_spec %>% step_rm(date)) %>%
  fit(training(splits$splits[[sp]]))

As with the statistical models, they are stored in a modeltime_table.

glmnetMod_table = modeltime_table(
  wflw_glmnet1,
  wflw_glmnet2,
  wflw_glmnet3
) 
glmnetMod_table
## # Modeltime Table
## # A tibble: 3 × 3
##   .model_id .model     .model_desc
##       <int> <list>     <chr>      
## 1         1 <workflow> GLMNET     
## 2         2 <workflow> GLMNET     
## 3         3 <workflow> GLMNET

3.2.1.1 Forecasting with the elastic net models

glmnetForecast = forecast_results(glmnetMod_table)
plot_forecast_table(glmnetForecast$calibration)
Accuracy Table
.model_id .model_desc .type mae mape mase smape rmse rsq
1 GLMNET Test 1544.49 3.86 1.18 3.87 1840.15 0.67
2 GLMNET Test 1545.39 3.86 1.18 3.87 1841.81 0.67
3 GLMNET Test 1537.22 3.84 1.17 3.85 1833.10 0.68

The performance is almost the same in the three models.

plot_forecast(dynForecast$fc)

As expected, the predictions are similar.

3.2.1.2 Re-train of the elastic net models

The evaluation of the models is carried out as before. The goal is finding the best model according to the selected hyper-parameters.

glmnetMod_resample_results = training_results(glmnetMod_table, splits)
## ── Fitting Resamples ────────────────────────────────────────────
## 
## 11.179 sec elapsed
plot_training_cv(glmnetMod_resample_results)

As in the forecasting, the resample performance is almost equal.

plot_training_table(glmnetMod_resample_results)
Accuracy Table
.model_id .model_desc .type n mae_mean mape_mean mase_mean smape_mean rmse_mean rsq_mean
1 GLMNET Resamples 5 2118.08 4.93 2.02 4.93 2473.77 0.37
2 GLMNET Resamples 5 2117.04 4.92 2.01 4.93 2472.11 0.37
3 GLMNET Resamples 5 2120.50 4.93 2.02 4.93 2477.99 0.37

As expected, the three models obtain very close results.

3.2.2 Random forest

In this case, the selected hyper-parameters are:

  • mtry: an integer for the number of predictors that will be randomly sampled at each split when creating the tree models.
  • trees: an integer for the number of trees contained in the ensemble.
  • min_n: an integer for the minimum number of data points in a node that are required for the node to be split further.

In order to get reproducible results, the seed is fixed, since this model introduces randomness.

We start with a low value for the number of trees.

set.seed(123)
mod_rf1 = 
  rand_forest(
    mtry = 5,
    trees = 50,
    min_n = 5,
    ) %>%
  set_engine("randomForest")

wflw_rf1 = workflow() %>%
  add_model(mod_rf1) %>%
  add_recipe(recipe_spec %>% step_rm(date)) %>%
  fit(training(splits$splits[[sp]]))

The number of trees is increased.

set.seed(123)
mod_rf2 = 
  rand_forest(
    mtry = 5,
    trees = 100,
    min_n = 5,
    ) %>%
  set_engine("randomForest")

wflw_rf2 = workflow() %>%
  add_model(mod_rf2) %>%
  add_recipe(recipe_spec %>% step_rm(date)) %>%
  fit(training(splits$splits[[sp]]))

trees increases again.

set.seed(123)
mod_rf3 = 
  rand_forest(
    mtry = 5,
    trees = 200,
    min_n = 5,
    ) %>%
  set_engine("randomForest")

wflw_rf3 = workflow() %>%
  add_model(mod_rf3) %>%
  add_recipe(recipe_spec %>% step_rm(date)) %>%
  fit(training(splits$splits[[sp]]))

Store them in the model table.

rfMod_table = modeltime_table(
  wflw_rf1,
  wflw_rf2,
  wflw_rf3
) 
rfMod_table
## # Modeltime Table
## # A tibble: 3 × 3
##   .model_id .model     .model_desc 
##       <int> <list>     <chr>       
## 1         1 <workflow> RANDOMFOREST
## 2         2 <workflow> RANDOMFOREST
## 3         3 <workflow> RANDOMFOREST

3.2.2.1 Forecasting with the random forest models

rfForecast = forecast_results(rfMod_table)
plot_forecast_table(rfForecast$calibration)
Accuracy Table
.model_id .model_desc .type mae mape mase smape rmse rsq
1 RANDOMFOREST Test 2354.59 5.99 1.79 5.74 2854.09 0.36
2 RANDOMFOREST Test 2241.57 5.73 1.71 5.49 2798.66 0.36
3 RANDOMFOREST Test 2294.46 5.86 1.75 5.61 2845.40 0.36

Again, the models behave similar.

plot_forecast(rfForecast$fc)

And the predictions are very close.

3.2.2.2 Re-train of the random forest models

The evaluation procedure as usual, training using all the splits.

rfMod_resample_results = training_results(rfMod_table, splits)
## ── Fitting Resamples ────────────────────────────────────────────
## 
## 11.55 sec elapsed
plot_training_cv(rfMod_resample_results)

The models perform very similar.

plot_training_table(rfMod_resample_results)
Accuracy Table
.model_id .model_desc .type n mae_mean mape_mean mase_mean smape_mean rmse_mean rsq_mean
1 RANDOMFOREST Resamples 5 2797.47 6.29 2.49 6.19 3317.35 0.24
2 RANDOMFOREST Resamples 5 2883.14 6.58 2.60 6.43 3441.51 0.23
3 RANDOMFOREST Resamples 5 2730.33 6.25 2.46 6.11 3319.51 0.21

As it can be seen, the number of trees is not very relevant, since changing this feature does not lead to a significant change in the performance.

3.2.3 XGBoost

The employed hyper-parameters are:

  • trees: as before, an integer for the number of trees contained in the ensemble.
  • min_n: as before, an integer for the minimum number of data points in a node that is required for the node to be split further.
  • tree_depth: an integer for the maximum depth of the tree.
mod_xgboost1 = 
  boost_tree(
    trees = 50,
    min_n = 20,
    tree_depth = 6
  ) %>%
  set_engine("xgboost")

wflw_xgboost1 = workflow() %>%
  add_model(mod_xgboost1) %>%
  add_recipe(recipe_spec %>% step_rm(date)) %>%
  fit(training(splits$splits[[sp]]))

The min_n is decreased while the tree_depth increases.

mod_xgboost2 = 
  boost_tree(
    trees = 50,
    min_n = 10,
    tree_depth = 7
  ) %>%
  set_engine("xgboost")

wflw_xgboost2 = workflow() %>%
  add_model(mod_xgboost2) %>%
  add_recipe(recipe_spec %>% step_rm(date)) %>%
  fit(training(splits$splits[[sp]]))

Same step is applied again.

mod_xgboost3 = 
  boost_tree(
    trees = 50,
    min_n = 5,
    tree_depth = 8
  ) %>%
  set_engine("xgboost")

wflw_xgboost3 = workflow() %>%
  add_model(mod_xgboost3) %>%
  add_recipe(recipe_spec %>% step_rm(date)) %>%
  fit(training(splits$splits[[sp]]))

Store models in the table.

xgMod_table = modeltime_table(
  wflw_xgboost1,
  wflw_xgboost2,
  wflw_xgboost3
) 
xgMod_table
## # Modeltime Table
## # A tibble: 3 × 3
##   .model_id .model     .model_desc
##       <int> <list>     <chr>      
## 1         1 <workflow> XGBOOST    
## 2         2 <workflow> XGBOOST    
## 3         3 <workflow> XGBOOST

3.2.3.1 Forecasting with the XGBoost models

xgForecast = forecast_results(xgMod_table)
plot_forecast_table(xgForecast$calibration)
Accuracy Table
.model_id .model_desc .type mae mape mase smape rmse rsq
1 XGBOOST Test 1561.01 3.88 1.19 3.83 1877.63 0.66
2 XGBOOST Test 1990.02 5.02 1.52 4.85 2553.93 0.49
3 XGBOOST Test 1784.99 4.55 1.36 4.42 2217.33 0.37

The first models overcomes the other two.

plot_forecast(xgForecast$fc)

In the forecast plot we can see how the red model is closer to the actual data.

3.2.3.2 Re-train of the XGBoost models

Training and evaluation process.

xgMod_resample_results = training_results(xgMod_table, splits)
## ── Fitting Resamples ────────────────────────────────────────────
## 
## 11.502 sec elapsed
plot_training_cv(xgMod_resample_results)

The performance of the third model overcomes the other two in most of the splits and metrics.

plot_training_table(xgMod_resample_results)
Accuracy Table
.model_id .model_desc .type n mae_mean mape_mean mase_mean smape_mean rmse_mean rsq_mean
1 XGBOOST Resamples 5 2878.20 6.14 2.52 6.18 3447.91 0.25
2 XGBOOST Resamples 5 2818.00 6.26 2.52 6.23 3467.26 0.15
3 XGBOOST Resamples 5 2453.09 5.62 2.27 5.54 3102.39 0.16

As expected, the best result is obtained by the third model. In this case min_n and tree_depth have a clear impact in the performance.

Mention the difference between the initial forecasting and the posterior re-training, now the best model is the third one and the worst the first, just the other way round.

3.2.4 Neural network - NNETAR

A neural network is the most complex model employed in this work. The selected hyper-parameters are:

  • non_seasonal_ar: the order of the non-seasonal auto-regressive (AR) terms. In this case we selected the same value as in the arima models.
  • seasonal_ar: the order of the seasonal auto-regressive (SAR) terms. The same value as in the arima models.
  • hidden_units: an integer for the number of units in the hidden model.
  • penalty: a non-negative numeric value for the amount of weight decay.
  • epochs: an integer for the number of training iterations.

Mention the initialization of the weights in the neural network is random, changing the performance in every execution. That is why the seed is fixed again here.

set.seed(123)
mod_nnetar1 = 
  nnetar_reg(
    non_seasonal_ar = 1,
    seasonal_ar = 0,
    hidden_units = 10,
    penalty = 0.1,
    epochs = 10,
    ) %>%
  set_engine("nnetar")

wflw_nnetar1 = workflow() %>%
  add_model(mod_nnetar1) %>%
  add_recipe(recipe_spec) %>%
  fit(training(splits$splits[[sp]]))

The number of hidden units is increased. In order to avoid overfitting, the penalty is increased and the number of epochs reduced.

set.seed(123)
mod_nnetar2 = 
  nnetar_reg(
    seasonal_period = "auto",
    non_seasonal_ar = 1,
    seasonal_ar = 0,
    hidden_units = 20,
    penalty = 0.2,
    epochs = 5,
    ) %>%
  set_engine("nnetar")

wflw_nnetar2 = workflow() %>%
  add_model(mod_nnetar2) %>%
  add_recipe(recipe_spec) %>%
  fit(training(splits$splits[[sp]]))

The number of epochs is reduced again.

set.seed(123)
mod_nnetar3 = 
  nnetar_reg(
    seasonal_period = "auto",
    non_seasonal_ar = 1,
    seasonal_ar = 0,
    hidden_units = 25,
    penalty = 0.2,
    epochs = 3,
    ) %>%
  set_engine("nnetar")

wflw_nnetar3 = workflow() %>%
  add_model(mod_nnetar3) %>%
  add_recipe(recipe_spec) %>%
  fit(training(splits$splits[[sp]]))

The three networks are stored.

set.seed(123)
nnMod_table = modeltime_table(
  wflw_nnetar1,
  wflw_nnetar2,
  wflw_nnetar3
) 
nnMod_table
## # Modeltime Table
## # A tibble: 3 × 3
##   .model_id .model     .model_desc
##       <int> <list>     <chr>      
## 1         1 <workflow> NNAR(1,10) 
## 2         2 <workflow> NNAR(1,20) 
## 3         3 <workflow> NNAR(1,25)

3.2.4.1 Forecasting with the NNETAR models

nnForecast = forecast_results(nnMod_table)
plot_forecast_table(nnForecast$calibration)
Accuracy Table
.model_id .model_desc .type mae mape mase smape rmse rsq
1 NNAR(1,10) Test 1025.39 2.59 0.78 2.56 1177.56 0.80
2 NNAR(1,20) Test 1182.16 2.94 0.90 2.94 1385.34 0.81
3 NNAR(1,25) Test 1523.76 3.76 1.16 3.86 1865.49 0.68

The first network is more accurate. Mention it is the simplest one in terms of parameters.

plot_forecast(nnForecast$fc)

As expected, the red forecast is closer to the actual data.

3.2.4.2 Re-train of the NNETAR models

Perform cross validation.

nnMod_resample_results = training_results(nnMod_table, splits)
## ── Fitting Resamples ────────────────────────────────────────────
## 
## 15.861 sec elapsed
plot_training_cv(nnMod_resample_results)

The first model overcomes the other two in most of splits and metrics.

plot_training_table(nnMod_resample_results)
Accuracy Table
.model_id .model_desc .type n mae_mean mape_mean mase_mean smape_mean rmse_mean rsq_mean
1 NNAR(1,10) Resamples 5 2583.34 5.43 2.18 5.48 2823.77 0.66
2 NNAR(1,20) Resamples 5 3342.63 7.06 2.91 7.07 3643.16 0.44
3 NNAR(1,25) Resamples 5 3188.16 7.01 2.74 7.31 3654.71 0.40

The simplest neural network obtains the best performance.

It is important to remark that selecting the hyper-parameters in a neural network is a complex and time consuming process. The number of parameters increases a lot with larger values in the hyper-parameters, having the risk of overfitting. So it is a good practice taking care of obtaining a too complex network.

3.3 Comparison of the best machine learning models

Finally the best machine learning models are selected to predict the value of the target.

mlMod_table = modeltime_table(
  wflw_glmnet3,
  wflw_rf2,
  wflw_xgboost1,
  wflw_nnetar1
) 
mlMod_table
## # Modeltime Table
## # A tibble: 4 × 3
##   .model_id .model     .model_desc 
##       <int> <list>     <chr>       
## 1         1 <workflow> GLMNET      
## 2         2 <workflow> RANDOMFOREST
## 3         3 <workflow> XGBOOST     
## 4         4 <workflow> NNAR(1,10)

3.3.1 Forecasting with the best machine learning models

mlForecast = forecast_results(mlMod_table)
plot_forecast_table(mlForecast$calibration)
Accuracy Table
.model_id .model_desc .type mae mape mase smape rmse rsq
1 GLMNET Test 1537.22 3.84 1.17 3.85 1833.10 0.68
2 RANDOMFOREST Test 2241.57 5.73 1.71 5.49 2798.66 0.36
3 XGBOOST Test 1561.01 3.88 1.19 3.83 1877.63 0.66
4 NNAR(1,10) Test 1025.39 2.59 0.78 2.56 1177.56 0.80

The best model corresponds with the neural network, followed by the elastic net.

plot_forecast(mlForecast$fc)

As it can be seen, the NNETAR model fits very well the data as well as the elastic net. Nevertheless, all models make good predictions with narrow prediction intervals.

3.3.2 Cross validation of the best machine learning models

The cross validation process is repeated again.

ml_resample_results = training_results(mlMod_table, splits)
## ── Fitting Resamples ────────────────────────────────────────────
## 
## 16.568 sec elapsed
plot_training_cv(ml_resample_results)

The elastic net is the best in most of the splits and metrics. Mention the behavior of the neural network is very irregular, in some splits it is very accurate while in others the error is large.

plot_training_table(ml_resample_results)
Accuracy Table
.model_id .model_desc .type n mae_mean mape_mean mase_mean smape_mean rmse_mean rsq_mean
1 GLMNET Resamples 5 2120.50 4.93 2.02 4.93 2477.99 0.37
2 RANDOMFOREST Resamples 5 2737.18 6.30 2.48 6.14 3307.81 0.22
3 XGBOOST Resamples 5 2878.20 6.14 2.52 6.18 3447.91 0.25
4 NNAR(1,10) Resamples 5 2778.90 5.71 2.33 5.77 2965.17 0.67

As expected, the best result is obtained by the elastic net, whereas the most complex model, the NNETAR network, has decreased its score averaging with all the splits.

The machine learning models got better forecasting results and cross validation performance than the statistical models.

3.4 Ensembles

As a last point, it is interesting to see if the performance obtained by NNETAR model can be improved by means of ensembles.

If we look again into the forecast table, the second and third best models are the elastic net and the XGBoost respectively.

plot_forecast_table(mlForecast$calibration)
Accuracy Table
.model_id .model_desc .type mae mape mase smape rmse rsq
1 GLMNET Test 1537.22 3.84 1.17 3.85 1833.10 0.68
2 RANDOMFOREST Test 2241.57 5.73 1.71 5.49 2798.66 0.36
3 XGBOOST Test 1561.01 3.88 1.19 3.83 1877.63 0.66
4 NNAR(1,10) Test 1025.39 2.59 0.78 2.56 1177.56 0.80

3.4.1 Elastic net and XGBoost

First ensemble is a combination of these two.

idx1 = c(1, 3) # selection of models

mod_ensemble1 = mlForecast$calibration[idx1,] %>%
  ensemble_average(type = "mean")

mod_ensemble1
## ── Modeltime Ensemble ───────────────────────────────────────────
## Ensemble of 2 Models (MEAN)
## 
## # Modeltime Table
## # A tibble: 2 × 5
##   .model_id .model     .model_desc .type .calibration_data
##       <int> <list>     <chr>       <chr> <list>           
## 1         1 <workflow> GLMNET      Test  <tibble [31 × 4]>
## 2         3 <workflow> XGBOOST     Test  <tibble [31 × 4]>

The forecasting process is the same as before.

forecast_ens1 = forecast_results(mod_ensemble1)
plot_forecast_table(forecast_ens1$calibration)
Accuracy Table
.model_id .model_desc .type mae mape mase smape rmse rsq
1 ENSEMBLE (MEAN): 2 MODELS Test 1266.32 3.18 0.96 3.17 1480.73 0.77

However, the ensemble does not improve NNETAR performance.

plot_forecast(forecast_ens1$fc)

Nevertheless, the fit seems to be appropriate for the data and could be used if we want to use a faster and lighter model.

3.4.2 Elastic net and NNETAR

Now the ensemble is the combination of the elastic net model and the neural network.

set.seed(123)
idx2 = c(1, 4) # selection of models

mod_ensemble2 = mlForecast$calibration[idx2,] %>%
  ensemble_average(type = "mean")

mod_ensemble2
## ── Modeltime Ensemble ───────────────────────────────────────────
## Ensemble of 2 Models (MEAN)
## 
## # Modeltime Table
## # A tibble: 2 × 5
##   .model_id .model     .model_desc .type .calibration_data
##       <int> <list>     <chr>       <chr> <list>           
## 1         1 <workflow> GLMNET      Test  <tibble [31 × 4]>
## 2         4 <workflow> NNAR(1,10)  Test  <tibble [31 × 4]>

The accuracy results.

forecast_ens2 = forecast_results(mod_ensemble2)
plot_forecast_table(forecast_ens2$calibration)
Accuracy Table
.model_id .model_desc .type mae mape mase smape rmse rsq
1 ENSEMBLE (MEAN): 2 MODELS Test 1040.91 2.6 0.79 2.59 1264.53 0.77

In this case, the ensemble got almost the same accuracy than the single neural network.

plot_forecast(forecast_ens2$fc)

3.4.3 XGBoost net and NNETAR

Finally, the last ensemble is composed by the XGBoost and the neural network.

set.seed(123)
idx3 = c(3, 4) # selection of models

mod_ensemble3 = mlForecast$calibration[idx3,] %>%
  ensemble_average(type = "mean")

mod_ensemble3
## ── Modeltime Ensemble ───────────────────────────────────────────
## Ensemble of 2 Models (MEAN)
## 
## # Modeltime Table
## # A tibble: 2 × 5
##   .model_id .model     .model_desc .type .calibration_data
##       <int> <list>     <chr>       <chr> <list>           
## 1         3 <workflow> XGBOOST     Test  <tibble [31 × 4]>
## 2         4 <workflow> NNAR(1,10)  Test  <tibble [31 × 4]>
forecast_ens3 = forecast_results(mod_ensemble3)
plot_forecast_table(forecast_ens3$calibration)
Accuracy Table
.model_id .model_desc .type mae mape mase smape rmse rsq
1 ENSEMBLE (MEAN): 2 MODELS Test 937.74 2.35 0.71 2.32 1158.56 0.83

Surprisingly, this ensemble overcomes the previous one composed by the two best models and gets a better result than the single neural network.

plot_forecast(forecast_ens3$fc)

As it can be seen, the model fits very well the data and the prediction intervals are very narrow.

4. Interpretation and conclusions

After the whole analysis, several conclusions can be taken out respect the data set and the models employed.

At first point, how the evaluation of the models influences a lot the decision. As we have seen, training and testing in one split provides us a performance to clearly determine which is the optimal model. However, in the moment when the cross validation is applied and the models are evaluated in the whole set of splits the decision is more difficult.

This behavior is the expected and that is why we apply cross validation in time series. As we are providing more data to the model, one could behave better or worse than previously. In this point, the decision should be carried out according to previous experiences or in what we are interested, thinking in how much data should be used to train.

The neural network NNETAR as well as the ensemble composed by this one and the XGBoost model, obtain the best results in the test set of the first split and the forecasting is very accurate. However, if we re-train the models with all splits, we have seen how the simpler elastic net model obtains a better score. In this point is when we should consider which model we want to use and it it is worthy to use more data or only the recent one.

Consulting to an expert in the topic would be the optimal approach, trying to know if old fluctuations or any past seasonality may affect the future forecast and choose a model which performs better as average, rather than in the most recent data. In the case of bitcoin’s price, where market is being updated constantly, the daily frequency may be even too slow and data collected months ago very old. For this reason, the best approach may be using the models which behaves better in the most recent data, in our case, the ensemble composed by the NNETAR and the XGBoost.

final_forecast = forecast_results(mod_ensemble3, back=150)
plot_forecast(final_forecast$fc)

At second point, how the variables are related with the response. If we look again into the time series plots.

bitcoin_by_var %>% 
  ggplot(aes(x = date, y = value)) +
  geom_line() +
  labs(title = "Daily Values",
       x = "", y="variables") +
  facet_grid(vars(var), scales = "free_y") +
  theme_bw() 

At first instance the variables trans and trans_cost may have a high correlation with the response price, since the trend is similar. In the case of volume the similarity is not very clear. This could be related with the result obtained in the VAR model, in which only these two variables have lagged effects into the target.

Mention also the weekly seasonality, since the VAR model returns a 7 day lagged effect from price, and we have seen in the correlograms and the STL decomposition that the variables have a weekly seasonal pattern.

In addition to that, the inclusion of variables have improved the performance of the simple arima model and the dynamic regression with more predictors provides the best performance. For this reason, these features are being util for the prediction.